We chose the NYC Open Dataset for NYPD Motor Vehicle Collisions. This is an open source dataset updated on a daily basis that tracks all road accidents in New York City. We want to use this dataset to infer trends about the various factors that cause accidents. A key question we wish to answer is Can we provide insights to help make the streets of NYC safer? Understanding the various reasons for accidents can certainly help us answer this question. For example, we want to identify the “hotspots” - intersections, streets etc. where most accidents occur. This could be a useful insight for authorities to take to necessary actions to remediate some of the issues.
In addition to this report, we have 2 interactive plots:
The github repo for the project is here.
Aakanksha Joshi - aj2771 The CTO and Editorial Director. Created the D3 interactive chart for Accidents by Borough, contributed to the Location analysis, Vision Zero analysis and executive summary.
Adarsh Chavakula - ac4244 The CEO and (Tidyverse) Advisory Committee. Contributed to the data quality analysis, Columbia neighborhood analysis and compiled the project report.
Deepak Maran - dm3308 Board Executive and Chief Design Officer. Created the D3 interactive chart for Contributing Factors over Time, contributed to the Time Analysis and executive summary.
Harsheel Singh Soin - hss2148 The COO and Design Insights Lead. Contributed to the Accident Information analysis, location analysis, heatmaps, choropleths, the Vision Zero analysis and executive summary.
The data was collected from the NYC Open Data website. It provides information on the exact location, borough, time, cause, type of vehicles, number of deaths and injuries for every accident. The data is from third quarter of 2012 to the first quarter of 2018. This data is collected by the police department (NYPD) because the NYC Council passed a local law in 2011. The data is manually run every month and reviewed by the TrafficStat unit before being posted on the NYPD website.
Each record represents a collision in NYC by borough, precinct and cross street. The data can potentially provide insights into how dangerous or safe certain intersections in NYC are. The information is available in both pdf and csv formats.
We downloaded the CSV file from the NYC Open Data webpage for this dataset. The dataset is updated on a daily basis (during weekdays). For the sake of uniformity and consistency of analyses, we used the accident data collected upto March 13, 2018.
Features: The dataset is 1.22 million rows and 29 columns (260 MB). The columns cover key aspects of every accident: Date, Time, Borough, Zip Code, Latitude, Longitude, Location, On Street Name, Cross Street Name, Number of Persons Injured, Number of Persons Killed, Number of Pedestrians Injured, Number of Pedestrians Killed, Number of Cyclist Injured, Number of Cyclist Killed, Number of Motorist Injured, Number of Motorist Killed etc.
library(tidyverse)
library(ggmap)
library(choroplethrZip)
accidents <- read_csv("NYPD_Motor_Vehicle_Collisions.csv")
names(accidents)
## [1] "DATE" "TIME"
## [3] "BOROUGH" "ZIP CODE"
## [5] "LATITUDE" "LONGITUDE"
## [7] "LOCATION" "ON STREET NAME"
## [9] "CROSS STREET NAME" "OFF STREET NAME"
## [11] "NUMBER OF PERSONS INJURED" "NUMBER OF PERSONS KILLED"
## [13] "NUMBER OF PEDESTRIANS INJURED" "NUMBER OF PEDESTRIANS KILLED"
## [15] "NUMBER OF CYCLIST INJURED" "NUMBER OF CYCLIST KILLED"
## [17] "NUMBER OF MOTORIST INJURED" "NUMBER OF MOTORIST KILLED"
## [19] "CONTRIBUTING FACTOR VEHICLE 1" "CONTRIBUTING FACTOR VEHICLE 2"
## [21] "CONTRIBUTING FACTOR VEHICLE 3" "CONTRIBUTING FACTOR VEHICLE 4"
## [23] "CONTRIBUTING FACTOR VEHICLE 5" "UNIQUE KEY"
## [25] "VEHICLE TYPE CODE 1" "VEHICLE TYPE CODE 2"
## [27] "VEHICLE TYPE CODE 3" "VEHICLE TYPE CODE 4"
## [29] "VEHICLE TYPE CODE 5"
head(accidents)
Noteworthy Features: It is interesting how much detail NYPD has gone into while recording this information about each accident, so much that upto 5 vehicles contributing to an accident are noted, along with their corresponding vehicle codes; the street intersections are noted along with exact latitude and longitude information. They’ve also recorded extra information such as the number of cyclists and pedestrians killed/injured as a result of the collision/accident.
We started with the analysis of missing values across the variables. Our first step was to use the skim function in the skimr package to get the percentage of missing values across the variables.
skimr::skim(accidents) %>% filter(stat == "missing") %>%
arrange(desc(value)) %>% select(variable, type, value) %>% mutate(percent = value/nrow(accidents))
We observe that variables like CONTRIBUTING FACTOR VEHICLE 5, VEHICLE TYPE CODE 5 etc. have the highest missing values - over 99%. These variables correspond to description of accidents involving as high as 5 vehicles, which would happen in very few cases.
It is noteworthy that variables like BOROUGH and ZIP CODE have about 28% missing values.
Since most of the observations do not have variables CONTRIBUTING FACTOR VEHICLE 5, VEHICLE TYPE CODE 5 etc, we decided to drop them. To avoid loss of useful information from this, we created a new binary variable which indicates if 3 or more vehicles were involved in an accident.
accidents <- accidents %>% mutate(VEHICLES_INVOLVED_GTE3 = 1 * !is.na(`CONTRIBUTING FACTOR VEHICLE 3`))
accidents <- accidents %>% select(-c(`VEHICLE TYPE CODE 5`,`CONTRIBUTING FACTOR VEHICLE 5`,
`VEHICLE TYPE CODE 4`,`CONTRIBUTING FACTOR VEHICLE 4`,
`VEHICLE TYPE CODE 3`,`CONTRIBUTING FACTOR VEHICLE 3`,
`OFF STREET NAME`))
Our next step is to identify if the values are missing at random. The visna function in the extracat library is useful for that.
library(extracat)
visna(accidents)
We observe that several variables do not have any missing values. It is more informative to do the visna considering a subset of the variables that have missing values.
accidents_sub <- accidents %>% select(BOROUGH,`ZIP CODE`,LOCATION,
`ON STREET NAME`, `CROSS STREET NAME`,
`CONTRIBUTING FACTOR VEHICLE 1`,`VEHICLE TYPE CODE 1`)
visna(accidents_sub)
We observe the following:
Vehicle Types and Contributing Factor are a part of all missing patterns.Borough and ZIP Code tend to be missing together.On Street and Cross Street also tend to be missing simultaneously.Based on these observations we can conclude that the variables are Missing at Random (MAR).
We wanted to see how the quality of the data has been changing over time. Our initial hypothesis for the missing values was that the data collection process might not have been great when the NYC Open Data portal started and may have improved over the years. However, the data suggests otherwise:
accidents <- accidents %>%
mutate(DATE = as.Date(DATE, format='%m/%d/%Y')) %>%
arrange(DATE) %>%
mutate(YEAR = format(DATE, "%Y")) %>%
mutate(MONTH = format(DATE, "%m"))
accidents_missing <- accidents %>%
mutate(MISSING_BOROUGH=1*is.na(BOROUGH),
MISSING_CONTRIBUTION_INFO=1*is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) %>%
select(MISSING_BOROUGH, MISSING_CONTRIBUTION_INFO, YEAR, BOROUGH)
missing_borough_by_year <- accidents_missing %>%
group_by(YEAR) %>% summarise(pc_missing_borough=mean(MISSING_BOROUGH))
missing_contrib_by_year <- accidents_missing %>%
group_by(YEAR) %>% summarise(pc_missing_contrib=mean(MISSING_CONTRIBUTION_INFO))
missing_contrib_by_borough <- accidents_missing %>%
group_by(BOROUGH) %>% summarise(pc_missing_contrib=mean(MISSING_CONTRIBUTION_INFO))
ggplot(data=missing_borough_by_year, aes(x=YEAR, y=pc_missing_borough)) + geom_col(fill="lightblue", color="black") +
xlab("Year") + ylab("% Missing Borough Information") +
ggtitle("% Missing Borough Information vs Year") +
theme(plot.title = element_text(hjust = 0.5))
It appears that the quality of the data has been deteriorating over time. 2012 had only about 23% missing Borough information while 2017 was close to 40%.
Let’s first look at the broad level of the different NYC boroughs and see where most accidents happen:
ggplot(data=subset(accidents, !is.na(BOROUGH)), aes(BOROUGH)) + geom_bar(fill="lightblue", color="black") + xlab("Borough") + ylab('Count of Accidents') + ggtitle('Accidents by Borough (2012 - 2018)') + theme(plot.title = element_text(hjust = 0.5))
It appears that Brooklyn is the borough with the most number of accidents. This is not surprising as it is the most populated one. We can get a better understanding of this by looking at the ratio of annual accidents to the populations of these boroughs (Population data).
We also visualized Borough-Wise Distribution of Accidents from 2013-2018. This is an interactive D3 chart.
# Accidents per 100,000 in all Boroughs
borough_per_1000 <- accidents %>%
group_by(`BOROUGH`) %>%
summarise(count_accidents = n()) %>%
filter(!is.na(`BOROUGH`)) %>%
select(`BOROUGH`, count_accidents) %>%
arrange(desc(count_accidents))
pops <- c(1643734, 2629150, 1455720, 2333054, 476015)
boroughs <- c("MANHATTAN","BROOKLYN","BRONX","QUEENS","STATEN ISLAND")
pop_table <- data_frame(POPS = pops, BOROUGH = boroughs)
accidents_summary <- borough_per_1000 %>% left_join(pop_table, by="BOROUGH") %>% mutate(count_accidents = count_accidents/POPS *1000)
ggplot(data=accidents_summary, aes(x=BOROUGH, y=count_accidents)) + geom_col(fill="lightblue", color="black") + xlab("Borough") + ylab('Accidents per 1000 residents') + ggtitle('Accidents per 1000 residents by Borough') +
theme(plot.title = element_text(hjust = 0.5))
This chart reveals a different picture compared to the prior chart. Manhattan has a 25% higher number of accidents per 1000 residents compared to Brooklyn - making it more dangerous.
We looked at a more granular level of aggregation with ZIP codes to identify neighborhoods with the most accidents. These can be plotted on a map using the choroplethrZip library. This library is not available on CRAN and has to be downloaded from GitHub.
Installation Instructions:
install.packages("devtools")library(devtools)install_github('arilamstein/choroplethrZip@v1.3.0')library(choroplethrZip)
accidents_zip_other <- accidents %>% filter(!is.na(`ZIP CODE`)) %>%
group_by(`ZIP CODE`,BOROUGH) %>%
summarise(NUMBER_OF_ACCIDENTS=n())
names(accidents_zip_other) <- c("region", "BOROUGH","value")
accidents_zip_other$region <- factor(accidents_zip_other$region)
accidents_zip_other <- accidents_zip_other[!duplicated(accidents_zip_other$region),]
zip_choropleth(accidents_zip_other,
state_zoom = "new york",
county_zoom = c(36005,36047,36061,36081,36085),
title = "Accidents by Zip Code in New York City",
legend = "Number of Accidents") +
theme(plot.title = element_text(hjust = 0.5, size=15)) +
viridis::scale_fill_viridis(name="# Accidents",discrete = TRUE) + coord_map()
We can make a few quick observations from the chart:
Zooming in on Manhattan:
accidents_zip <- accidents %>% filter(!is.na(`ZIP CODE`)) %>% filter(BOROUGH=="MANHATTAN") %>% group_by(`ZIP CODE`) %>% summarise(NUMBER_OF_ACCIDENTS=n())
names(accidents_zip) <- c("region", "value")
accidents_zip$region <- factor(accidents_zip$region)
zip_choropleth(accidents_zip,
state_zoom = "new york",
county_zoom = 36061,
title = "Accidents by Zip Code in Manhattan",
legend = "Number of Accidents") +theme(plot.title = element_text(hjust = 0.5, size=15))+viridis::scale_fill_viridis(name="# Accidents",discrete = TRUE)+coord_map()
We also looked at the ZIP codes with the highest fatalities.
accidents_zip_deaths <- accidents %>% filter(!is.na(`ZIP CODE`)) %>% group_by(`ZIP CODE`,BOROUGH) %>% summarise(NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`))
names(accidents_zip_deaths) <- c("region", "BOROUGH","value")
accidents_zip_deaths$region <- factor(accidents_zip_deaths$region)
accidents_zip_deaths <- accidents_zip_deaths[!duplicated(accidents_zip_deaths$region),]
zip_choropleth(accidents_zip_deaths,
state_zoom = "new york",
county_zoom = c(36005,36047,36061,36081,36085),
title = "Deaths by Zip Code in New York City",
legend = "Number of Deaths") +theme(plot.title = element_text(hjust = 0.5, size=15))+viridis::scale_fill_viridis(name="# Deaths",discrete = TRUE)+coord_map()
The ZIP codes which have the highest accidents are the following:
zip_count <- accidents %>%
group_by(`ZIP CODE`) %>%
summarise(count_accidents = n()) %>%
filter(!is.na(`ZIP CODE`)) %>%
select(`ZIP CODE`, count_accidents) %>%
arrange(desc(count_accidents)) %>%
mutate(`ZIP CODE` = as.character(`ZIP CODE`))
zip_count_top10 <- zip_count[1:10,]
dotchart(zip_count_top10$count_accidents, labels = zip_count_top10$`ZIP CODE`,
pch=19, xlab = "Accident Count", groups = -zip_count_top10$count_accidents,
xlim=c(0,16000), ylab = "ZIP code",
main = "Accidents - Top 10 ZIP codes")
We also looked at the top 10 ZIP codes for fatalities:
zip_count_deaths <- accidents %>%
group_by(`ZIP CODE`) %>%
summarise(count_deaths = sum(`NUMBER OF PERSONS KILLED`)) %>%
filter(!is.na(`ZIP CODE`)) %>%
select(`ZIP CODE`, count_deaths) %>%
arrange(desc(count_deaths)) %>%
mutate(`ZIP CODE` = as.character(`ZIP CODE`))
zip_count_deaths_top10 <- zip_count_deaths[1:10,]
dotchart(zip_count_deaths_top10$count_deaths, labels = zip_count_deaths_top10$`ZIP CODE`,
pch=19, xlab = "Deaths", xlim=c(0,30), main = "Deaths - Top 10 ZIP codes", ylab = "ZIP code",
groups = -zip_count_deaths_top10$count_deaths)
To get a more granular view of where most accidents happen, we superimposed a heatmap on top of the NYC map to identify specific roads and interesctions with most accidents. The NYC map was obtained using the handy ggmap library.
nycmap <- qmap("new york", zoom = 11, color = "bw")
nycmap + stat_bin2d(aes(x = LONGITUDE, y = LATITUDE), size = .5, bins = 150,
alpha = 0.5, data = accidents) + viridis::scale_fill_viridis() +
ggtitle("Accident Density by Location") + theme(plot.title = element_text(hjust = 0.5))
The heatmap is lit up brightly for midtown and lower Manhattan. There are also hotspots in Brooklyn (along Atlantic Avenue) which are brightly lit. This chart illustrates why the accidents per 1000 residents is the highest for Manhattan.
“Vision Zero is a safe-streets initiative created by New York City’s Department of Transportation (DOT) in 2014, the first year of Mayor Bill de Blasio’s administration. The Vision Zero concept and program emerged in the late 1990s in Sweden, which aimed to eliminate traffic deaths and serious injuries. A number of other countries and cities subsequently adopted similar traffic safety programs; and safe-street improvements were undertaken under the administration of New York’s previous mayor, Michael Bloomberg,” reports a Nacto report from 2018.. “In NYC, Vision Zero consists of reengineering intersections and streets: it simplifies complex intersections, narrows traffic lanes, adds speed bumps and bicycle paths, shortens pedestrian-crossing distances, alters the timing of traffic lights, adds speed-detection and red-light cameras, and installs “leading pedestrian intervals” to give pedestrians a head start at a light before drivers can turn into the crosswalk,” the report further adds.
“In January 2014, Mayor Bill de Blasio announced adoption of New York City’s Vision Zero and enumerated a long list of initiatives the city would be following to reduce fatalities on city streets. Among the measures it plans to take includes pushing for changes in the State legislature to allow the city more control in the administration of traffic safety measures such as speed reduction,” writes the Wikipedia on Vision Zero.
“According to a OneNYC report from 2017, the Department of Transportation (DOT) has completed a total of 356 street improvement projects, since the start of Vision Zero, with an emphasis on the city’s most crash-prone corridors and intersections. The Great Streets initiative focuses on improving safety on four of New York City’s major arterial routes: Atlantic Avenue, Fourth Avenue, Grand Concourse, and Queens Boulevard,” reports OneNYC.
We decided to look into the accident trends on Atlantic Avenue and Queens Boulevard and see if Vision Zero had an impact.
accidents_queens <- accidents %>% filter(`ON STREET NAME`=='QUEENS BOULEVARD') %>% group_by(`YEAR`) %>% summarise(`Number of Accidents`=n()) %>%
mutate(YEAR = as.numeric(YEAR))
ggplot(accidents_queens, aes(x=YEAR, y=`Number of Accidents`)) + geom_point() +geom_smooth(method = "loess", se=FALSE) + ggtitle("Number of Accidents by Year across Queens Boulevard") + theme(plot.title = element_text(hjust = 0.5))+xlab("Year")
accidents_atlav <- accidents %>% filter(`ON STREET NAME`=='ATLANTIC AVENUE') %>% group_by(`YEAR`) %>% summarise(`Number of Accidents`=n())%>% mutate(YEAR = as.numeric(YEAR))
ggplot(accidents_atlav, aes(x=YEAR, y=`Number of Accidents`)) + geom_point() +geom_smooth(method = "loess", se=FALSE) + ggtitle("Number of Accidents by Year across Atlantic Avenue") + theme(plot.title = element_text(hjust = 0.5))+xlab("Year")
We can see that the number of accidents across two of the most accident-prone streets, Atlantic Avenue and Queens Boulevard, has gone down considerably since the introduction of Vision Zero in 2014.
We tried to understand the trend in accidents over time. We started by looking at daily accidents over time (over the span of 5 years). This is obviously expected to be noisy, so we added a trend curve using the non-parametric loess smoothing.
accidents_per_day <- accidents %>%
group_by(DATE, BOROUGH) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`),
NUMBER_OF_INJURIES = sum(`NUMBER OF PERSONS INJURED`)) %>%
filter(NUMBER_OF_ACCIDENTS<400) # Remove one outlier point
ggplot(accidents_per_day, aes(x=DATE, y=NUMBER_OF_ACCIDENTS)) + geom_line(aes(colour="Accidents")) +
geom_smooth(method = "loess", span = .2, se = TRUE, show.legend = TRUE, aes(colour="Trend")) +
scale_colour_manual(name="Legend", values=c("black", "red")) +
xlab("Time") + ylab("Number of Daily Accidents") +
ggtitle("Daily Accidents") +
theme(plot.title = element_text(hjust = 0.5))
There are a few quick observations we can make from this chart:
To better understand the seasonality aspect of the data, we looked at the number of accidents aggregated over a month level.
accidents_by_month <- accidents %>% filter(YEAR > 2012) %>% filter(YEAR < 2018) %>%
group_by(MONTH) %>% summarise(num_accidents = n())
ggplot(accidents_by_month, aes(x=MONTH, y=num_accidents)) +
geom_bar(stat="identity", fill="lightblue", color="black")+
xlab("Month") + ylab("Total Accidents") +
ggtitle("Accidents by month") +
theme(plot.title = element_text(hjust = 0.5))
The chart shows a dip in the accidents for the first few months of an year. There seem to be fewer accidents in winter. We also tried to plot the same information in a polar coordinate plot to give a better sense of the seasonality.
ggplot(accidents_by_month, aes(x=MONTH, y=num_accidents)) +
geom_bar(stat="identity", fill="lightblue", color="black")+
xlab("Month") + ylab("Total Accidents") +
ggtitle("Accidents by month") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_polar(start=0)
Besides time of year, we were interested in looking at trends by time of day. We aggregated the data by the hour and looked at the accidents. Like before, we made both the cartesian bar charts as well as the polar bar charts.
accidents_by_timeofday <- accidents %>%
mutate(HOUR = format(as.POSIXct(TIME), "%H")) %>%
group_by(HOUR, BOROUGH) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`),
NUMBER_OF_INJURIES = sum(`NUMBER OF PERSONS INJURED`))# %>%
# filter(NUMBER_OF_ACCIDENTS<400) # Remove one outlier point
ggplot(accidents_by_timeofday, aes(x=HOUR, weights=NUMBER_OF_ACCIDENTS)) +
geom_bar(stat="count", fill="lightblue", color="black") +
xlab("Time") + ylab("Accidents") +
ggtitle("Accidents by Hour") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(accidents_by_timeofday, aes(x=HOUR, weights=NUMBER_OF_ACCIDENTS)) +
geom_bar( stat="count", fill="lightblue", color="black") +
xlab("Time") + ylab("Accidents") +
ggtitle("Accidents by Hour") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_polar()
The frequency of accidents peak once during the morning (around 9 am) and once in the evening (around 5 pm). This is probably because these are the peak hours of traffic. Also the number of accidents in the peak hours of the evening is much higher than the number of accidents in the morning hours.
The above chart is a polar coordinate representation of the bar chart preceding it. The Cartesian bar chart is significantly easier to read and draw inferences from. The differneces in values are linear and intuitive in the Cartesian chart. However it fails to capture the cyclical nature of months (December looping over to January) and time of day (23 to 0) which is better captured by the Polar bar chart.
We analyzed the different contributing factors leading to accidents over the years. This is captured by the CONTRIBUTING FACTOR VEHICLE 1 and CONTRIBUTING FACTOR VEHICLE 2 variables. They correspond to the reason for Vehicle 1 and Vehicle 2 respectively. These variables have a large number of missing values:
print("Number of missing values in `CONTRIBUTING FACTOR VEHICLE 1`")
## [1] "Number of missing values in `CONTRIBUTING FACTOR VEHICLE 1`"
print(sum(is.na(accidents$`CONTRIBUTING FACTOR VEHICLE 1`)))
## [1] 6297
print("Number of missing values in `CONTRIBUTING FACTOR VEHICLE 2`")
## [1] "Number of missing values in `CONTRIBUTING FACTOR VEHICLE 2`"
print(sum(is.na(accidents$`CONTRIBUTING FACTOR VEHICLE 2`)))
## [1] 170620
CONTRIBUTING FACTOR VEHICLE 1 has much fewer missing values than CONTRIBUTING FACTOR VEHICLE 2. Analyzing them separately for the purpose of better analyses:
checkContrFact1_2 <- accidents %>%
filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`) & !is.na(`CONTRIBUTING FACTOR VEHICLE 2`)) %>%
filter(`CONTRIBUTING FACTOR VEHICLE 1` != "Unspecified" & `CONTRIBUTING FACTOR VEHICLE 2` != "Unspecified") %>%
mutate(sameValue = `CONTRIBUTING FACTOR VEHICLE 1`==`CONTRIBUTING FACTOR VEHICLE 2`)
print("Number of records where `CONTRIBUTING FACTOR VEHICLE 1` equals `CONTRIBUTING FACTOR VEHICLE 2`:")
## [1] "Number of records where `CONTRIBUTING FACTOR VEHICLE 1` equals `CONTRIBUTING FACTOR VEHICLE 2`:"
print(sum(checkContrFact1_2$sameValue))
## [1] 98407
Upon analyzing CONTRIBUTING FACTOR VEHICLE 1 and CONTRIBUTING FACTOR VEHICLE 2 columns after omitting records containing NAs and only considering records where both CONTRIBUTING FACTOR VEHICLE 1 and CONTRIBUTING FACTOR VEHICLE 2 are specified, we observe that 98407 records out of 160161 have the same entry for these two columns.
We plot the top 10 contributing factors for Vehicle 1.
accidents_by_reason_1 <- accidents %>% filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) %>%
filter(`CONTRIBUTING FACTOR VEHICLE 1` != "Unspecified") %>%
group_by(`CONTRIBUTING FACTOR VEHICLE 1`) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(NUMBER_OF_ACCIDENTS))
accidents_by_reason_1 <- head(accidents_by_reason_1, n = 10)
ggplot(accidents_by_reason_1, aes(x=reorder(`CONTRIBUTING FACTOR VEHICLE 1`, `NUMBER_OF_ACCIDENTS`),
y=`NUMBER_OF_ACCIDENTS`)) +
geom_col(fill="lightblue", color="black") +
xlab("Contributing Factor Vehicle 1") +
ylab("Number of Accidents")+
ggtitle("Top 10 First Contributing Factors by Accident Count")+
theme(plot.title = element_text(hjust = 0.5))+coord_flip()
accidents_by_reason_2 <- accidents %>% filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 2`)) %>%
filter(`CONTRIBUTING FACTOR VEHICLE 2` != "Unspecified") %>%
group_by(`CONTRIBUTING FACTOR VEHICLE 2`) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(NUMBER_OF_ACCIDENTS))
accidents_by_reason_2 <- head(accidents_by_reason_2, n = 10)
ggplot(accidents_by_reason_2, aes(x=reorder(`CONTRIBUTING FACTOR VEHICLE 2`, `NUMBER_OF_ACCIDENTS`),
y=`NUMBER_OF_ACCIDENTS`)) +
geom_col(fill="lightblue", color="black") +
xlab("Contributing Factor Vehicle 2")+ylab("Number of Accidents")+
ggtitle("Top 10 Second Contributing Factors by Accident Count")+
theme(plot.title = element_text(hjust = 0.5))+coord_flip()
We used a heatmap to understand the combination of contributing factors for Vehicle 1 and contributing factors for Vehicle 2 resulting in most accidents.
accidents_by_vehicles <- accidents %>%
filter(!is.na(`VEHICLE TYPE CODE 1`) & !is.na(`VEHICLE TYPE CODE 2`)) %>%
filter(`VEHICLE TYPE CODE 1` != "UNKNOWN" & `VEHICLE TYPE CODE 2` != "UNKNOWN") %>%
group_by(`VEHICLE TYPE CODE 1`,`VEHICLE TYPE CODE 2`) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(NUMBER_OF_ACCIDENTS))
accidents_by_vehicles <- head(accidents_by_vehicles, n = 50)
ggplot(accidents_by_vehicles, aes(x = `VEHICLE TYPE CODE 1`, y = `VEHICLE TYPE CODE 2`, fill = NUMBER_OF_ACCIDENTS)) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
geom_tile()+xlab("First Vehicle Type")+ylab("Second Vehicle Type")+
ggtitle("Number of Accidents by combinations of Vehicle Types")+
theme(plot.title = element_text(size=18, hjust = 0.5))+
viridis::scale_fill_viridis()
We observe that most accidents occur when one or more drivers of the vehicles involved were distracted/inattentive.
We looked at Time Variation of Contributing Factors from 2013-2017. This is an interactive D3 chart.
We wanted to observe which locations in NYC were more likely to see the particular contributing factors we noted above. For this we focused only on Manhattan to get a closer view. We also looked only at the top 5 contributing factors.
accidents_by_reasons <- accidents %>%
filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`) & !is.na(`CONTRIBUTING FACTOR VEHICLE 2`)) %>%
filter(`CONTRIBUTING FACTOR VEHICLE 1` != "Unspecified" & `CONTRIBUTING FACTOR VEHICLE 2` != "Unspecified") %>%
group_by(`CONTRIBUTING FACTOR VEHICLE 1`,`CONTRIBUTING FACTOR VEHICLE 2`) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(NUMBER_OF_ACCIDENTS))
accidents_by_reasons_map <- head(accidents_by_reasons, n=5)
accidents_topCF12 <- accidents %>%
filter(`CONTRIBUTING FACTOR VEHICLE 1` %in% accidents_by_reasons_map$`CONTRIBUTING FACTOR VEHICLE 1` & `CONTRIBUTING FACTOR VEHICLE 2` %in% accidents_by_reasons_map$`CONTRIBUTING FACTOR VEHICLE 2`) %>%
filter(!is.na(LATITUDE) & !is.na(LONGITUDE)) %>% filter(BOROUGH=='MANHATTAN')
qmap('manhattan', zoom = 12,color = 'bw', legend = 'topleft') +
geom_point(aes(x=LONGITUDE, y=LATITUDE, color=`CONTRIBUTING FACTOR VEHICLE 1`),
data=accidents_topCF12,alpha=0.6, size=1.5) +
viridis::scale_color_viridis(discrete=TRUE)+
ggtitle("Spatial View of Accidents by Top 5 Contributing Factors in Manhattan")+
theme(plot.title = element_text(hjust = 0.5, size=15))
We observe an unusual pattern here - the different contributing factors have different “hotspots” across Manhattan. This is better illustrated by using 2D density plots faceted on contributing factor.
qmap('manhattan', zoom = 13,color = 'bw', legend = 'topleft') +
geom_point(aes(x=LONGITUDE, y=LATITUDE, color=`NUMBER OF PERSONS KILLED`),
data=accidents_topCF12,alpha=0.7, size=1.0) + viridis::scale_color_viridis()+ggtitle("Spatial View of Accidents by Top 5 Contributing Factors in Manhattan")+
theme(plot.title = element_text(hjust = 0.5, size=28))+
geom_density_2d(aes(x=LONGITUDE, y=LATITUDE, color=`NUMBER OF PERSONS KILLED`),
data=accidents_topCF12, color="red")+
facet_wrap(~ `CONTRIBUTING FACTOR VEHICLE 1`, nrow=3)+
theme(panel.spacing.x=unit(2, "lines"),panel.spacing.y=unit(2, "lines"))+
theme(strip.text.x = element_text(size = 25))